426 research outputs found
Making Digital Artifacts on the Web Verifiable and Reliable
The current Web has no general mechanisms to make digital artifacts --- such
as datasets, code, texts, and images --- verifiable and permanent. For digital
artifacts that are supposed to be immutable, there is moreover no commonly
accepted method to enforce this immutability. These shortcomings have a serious
negative impact on the ability to reproduce the results of processes that rely
on Web resources, which in turn heavily impacts areas such as science where
reproducibility is important. To solve this problem, we propose trusty URIs
containing cryptographic hash values. We show how trusty URIs can be used for
the verification of digital artifacts, in a manner that is independent of the
serialization format in the case of structured data files such as
nanopublications. We demonstrate how the contents of these files become
immutable, including dependencies to external digital artifacts and thereby
extending the range of verifiability to the entire reference tree. Our approach
sticks to the core principles of the Web, namely openness and decentralized
architecture, and is fully compatible with existing standards and protocols.
Evaluation of our reference implementations shows that these design goals are
indeed accomplished by our approach, and that it remains practical even for
very large files.Comment: Extended version of conference paper: arXiv:1401.577
Performance of the Charniak-Lease parser on biological text using different training corpora
POS tagging is used as the first step in many NLP workflows, although the accuracy of tag assignment frequently goes unchecked. We hypothesize that changing the training corpora for a parser will affect its POS tagging of a target corpus. To this end we train the Charniak-Lease parser on the WSJ corpus and two biomedical corpora and evaluate its output to MedPost, a POS tagger with a reported 97% accuracy on biomedical text. Our findings indicate that using biomedical training corpora significantly improves performance, but that minor differences in the biomedical training corpora have a significant effect on the correctness of POS tagging. Specifically, the tagging of hyphenated words and verbs was affected. This work suggests that the choice of training corpora is crucial to domain targeted NLP analysis
A web API ecosystem through feature based reuse
The fast-growing web API landscape brings clients more options than ever before-in theory. In practice, they cannot easily switch between different providers offering similar functionality. We discuss a vision for developing web APIs based on reuse of interface parts called features. Through the introduction of five design principles, we investigate the impact of feature-based reuse on web APIs. Applying these principles enables a granular reuse of client and server code, documentation, and tools. Together, they can foster a measurable ecosystem with cross-API compatibility, opening the door to a more flexible generation of web clients
A web API ecosystem through feature based reuse
The fast-growing web API landscape brings clients more options than ever before-in theory. In practice, they cannot easily switch between different providers offering similar functionality. We discuss a vision for developing web APIs based on reuse of interface parts called features. Through the introduction of five design principles, we investigate the impact of feature-based reuse on web APIs. Applying these principles enables a granular reuse of client and server code, documentation, and tools. Together, they can foster a measurable ecosystem with cross-API compatibility, opening the door to a more flexible generation of web clients
Advancing discovery science with fair data stewardship:Findable, accessible, interoperable, reusable
This report summarizes a presentation by Dr. Michel Dumontier. It reviews innovative scientific research methods created by data science, and the need to develop infrastructure, methodologies, and user communities to advance data science. Stakeholders have proposed a set of principles to make digital resources findable, accessible, interoperable, and reusable—FAIR. FAIR principles provide guidelines, do not require specific technologies, and allow communities of stakeholders to define specific FAIR standards and develop metrics to quantify them. Libraries can be part of the new data ecosystemby providing education, data stewardship, and infrastructure
A Web API ecosystem through feature-based reuse
The current Web API landscape does not scale well: every API requires its own hardcoded clients in an unusually short-lived, tightly coupled relationship of highly subjective quality. This directly leads to inflated development costs, and prevents the design of a more intelligent generation of clients that provide cross-API compatibility. We introduce 5 principles to establish an ecosystem in which Web APIs consist of modular interface features with shared semantics, whose implementations can be reused by clients and servers across domains and over time. Web APIs and their features should be measured for effectiveness in a task-driven way. This enables an objective and quantifiable discourse on the appropriateness of a certain interface design for certain scenarios, and shifts the focus from creating interfaces for the short term to empowering clients in the long term
NBLAST: a cluster variant of BLAST for NxN comparisons
BACKGROUND: The BLAST algorithm compares biological sequences to one another in order to determine shared motifs and common ancestry. However, the comparison of all non-redundant (NR) sequences against all other NR sequences is a computationally intensive task. We developed NBLAST as a cluster computer implementation of the BLAST family of sequence comparison programs for the purpose of generating pre-computed BLAST alignments and neighbour lists of NR sequences. RESULTS: NBLAST performs the heuristic BLAST algorithm and generates an exhaustive database of alignments, but it only computes [Image: see text] alignments (i.e. the upper triangle) of a possible N(2) alignments, where N is the set of all sequences to be compared. A task-partitioning algorithm allows for cluster computing across all cluster nodes and the NBLAST master process produces a BLAST sequence alignment database and a list of sequence neighbours for each sequence record. The resulting sequence alignment and neighbour databases are used to serve the SeqHound query system through a C/C++ and PERL Application Programming Interface (API). CONCLUSIONS: NBLAST offers a local alternative to the NCBI's remote Entrez system for pre-computed BLAST alignments and neighbour queries. On our 216-processor 450 MHz PIII cluster, NBLAST requires ~24 hrs to compute neighbours for 850000 proteins currently in the non-redundant protein database
- …